Ordination

Rodney Dyer, PhD

Overall Impetus

There are many times when we have several columns of data recorded on indiviudal observations.

  • Genotypes of individuals from seveal populations
  • Species counts across sampling locations
  • Climatic data (e.g., water/temperature) measured at several locations

Consequences

Some of the consequences of this is that we may have problems:

  • Visualizing more than 2-3 dimensions of the data
  • Understand which subset of the data are correlated (and thus redundant)
  • Trouble identifying signal from noise

Are there methods for visualization and quantification of data like this?

Eigen Structure

Eigen Desconstruction

A method to factor high dimensional data into additive subcomponents

Just like you can factor the equation -6x^2 + 5x + 4 = 0 into the factors (2x+1)(-3x+4), large data sets with N rows and K columns of data can be factored based upon their column-wise mean values, variances, and covariances between columns of data.

Way Cool Matrix Algebra

Consider the matrix of data X with N rows and K columns. The variance of each of the K data columns and their covariances, can be represented as an KxK covariance matrix and is derived from this fancy formula.

 

S = X'[X'X]^{-1}X

 

S = \left[ \begin{array}{cccc} \sigma_A^2 & \sigma_{AB}^2 & \ldots & \sigma_{AK}^2 \\ \sigma_{BA}^2 & \sigma_{B}^2 & \ldots & \sigma_{AK}^2 \\ \sigma_{CA}^2 & \sigma_{BC}^2 & \ddots & \sigma_{AK}^2 \\ \vdots & \vdots & \vdots & \vdots \\ \sigma_{KA}^2 & \sigma_{KD}^2 & \ldots & \sigma_{K}^2 \\ \end{array}\right]

Partitioning Variation & Covariation

So we can partition this matrix as:

S = \sum_{i=1}^K \lambda_{i} \ell^\prime_i \ell_i

Where:

  • \lambda_i is a scaling number, and

  • \ell_i is a 1xK vector of values.

Principal Component Rotations

Consider the following data

Marginal Distributions

 

 

 

 

 

Creating Othoginal Data

The transformation you are doing is baswed upon applying a linear transformation of the original data from its previous space into an identically sized new space.

 

[1] "sdev"     "loadings" "center"   "scale"    "n.obs"    "scores"   "call"    

 

Importance of components:
                          Comp.1     Comp.2     Comp.3     Comp.4     Comp.5
Standard deviation     2.8113021 2.19949725 1.98692071 1.76188725 1.35153653
Proportion of Variance 0.1362659 0.08341014 0.06806645 0.05352149 0.03149398
Cumulative Proportion  0.1362659 0.21967599 0.28774244 0.34126393 0.37275792
                          Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
Standard deviation     1.3052912 1.24832072 1.23585320 1.20816941 1.16573700
Proportion of Variance 0.0293756 0.02686732 0.02633333 0.02516678 0.02343005
Cumulative Proportion  0.4021335 0.42900084 0.45533417 0.48050095 0.50393100
                         Comp.11    Comp.12    Comp.13    Comp.14    Comp.15
Standard deviation     1.1479296 1.12805147 1.11077635 1.09681605 1.07210090
Proportion of Variance 0.0227197 0.02193966 0.02127283 0.02074147 0.01981725
Cumulative Proportion  0.5266507 0.54859035 0.56986318 0.59060466 0.61042190
                          Comp.16    Comp.17    Comp.18    Comp.19   Comp.20
Standard deviation     1.06907461 1.06258841 1.05051763 1.03671883 1.0197660
Proportion of Variance 0.01970553 0.01946714 0.01902737 0.01853079 0.0179297
Cumulative Proportion  0.63012743 0.64959457 0.66862194 0.68715273 0.7050824
                          Comp.21   Comp.22   Comp.23    Comp.24    Comp.25
Standard deviation     1.00291893 0.9913325 0.9779643 0.96869623 0.95675236
Proportion of Variance 0.01734218 0.0169438 0.0164899 0.01617883 0.01578233
Cumulative Proportion  0.72242461 0.7393684 0.7558583 0.77203714 0.78781947
                          Comp.26    Comp.27    Comp.28    Comp.29    Comp.30
Standard deviation     0.94329924 0.93880305 0.91823416 0.89529206 0.87049453
Proportion of Variance 0.01534161 0.01519571 0.01453714 0.01381979 0.01306484
Cumulative Proportion  0.80316108 0.81835679 0.83289393 0.84671372 0.85977856

 

 

Visualization

Principal Components Analysis on Frequencies

Just like working on raw data, but coalescing all the individuals into single populations defined by allele frquency matrices.

  Stratum AML-01 AML-02 AML-03    AML-04    AML-05 AML-06 AML-07 AML-08 AML-09
1     101   0.00      0      0 0.0000000 0.0000000   0.00   0.00   0.50   0.00
2     102   0.00      0      0 0.0000000 0.0000000   0.00   0.00   0.00   0.00
3      12   0.05      0      0 0.0000000 0.0000000   0.05   0.35   0.50   0.00
4     153   0.00      0      0 0.0000000 0.0000000   0.00   0.60   0.35   0.05
5     156   0.00      0      0 0.6666667 0.3333333   0.00   0.00   0.00   0.00
6     157   0.00      0      0 0.7000000 0.1000000   0.20   0.00   0.00   0.00
  AML-10 AML-11 AML-12 AML-13 ATPS-01   ATPS-02   ATPS-03   ATPS-04 ATPS-05
1   0.00    0.5      0      0       0 0.6666667 0.0000000 0.1111111    0.00
2   0.00    1.0      0      0       0 0.9375000 0.0000000 0.0000000    0.00
3   0.05    0.0      0      0       0 0.0000000 0.0000000 0.0000000    1.00
4   0.00    0.0      0      0       0 0.0000000 0.0000000 0.0000000    1.00
5   0.00    0.0      0      0       0 0.0000000 0.9166667 0.0000000    0.00
6   0.00    0.0      0      0       0 0.0000000 0.7000000 0.0000000    0.15

Principal Components Analysis on Frequencies

Just like working on raw data, but coalescing all the individuals into single populations defined by allele frquency matrices.

 

Importance of components:
                          PC1    PC2    PC3     PC4    PC5     PC6     PC7
Standard deviation     0.9961 0.7719 0.5274 0.39514 0.2898 0.25543 0.24065
Proportion of Variance 0.4135 0.2483 0.1159 0.06508 0.0350 0.02719 0.02414
Cumulative Proportion  0.4135 0.6618 0.7777 0.84282 0.8778 0.90501 0.92914
                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     0.19880 0.15834 0.14287 0.13783 0.12481 0.10191 0.09150
Proportion of Variance 0.01647 0.01045 0.00851 0.00792 0.00649 0.00433 0.00349
Cumulative Proportion  0.94562 0.95606 0.96457 0.97249 0.97898 0.98331 0.98680
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     0.08413 0.07641 0.07166 0.05890 0.05077 0.03845 0.03744
Proportion of Variance 0.00295 0.00243 0.00214 0.00145 0.00107 0.00062 0.00058
Cumulative Proportion  0.98975 0.99218 0.99432 0.99577 0.99685 0.99746 0.99805
                          PC22    PC23    PC24    PC25    PC26    PC27    PC28
Standard deviation     0.03216 0.02974 0.02461 0.02256 0.01880 0.01789 0.01682
Proportion of Variance 0.00043 0.00037 0.00025 0.00021 0.00015 0.00013 0.00012
Cumulative Proportion  0.99848 0.99884 0.99910 0.99931 0.99946 0.99959 0.99971
                          PC29    PC30    PC31     PC32     PC33     PC34
Standard deviation     0.01469 0.01358 0.01061 0.007838 0.006974 0.005382
Proportion of Variance 0.00009 0.00008 0.00005 0.000030 0.000020 0.000010
Cumulative Proportion  0.99980 0.99987 0.99992 0.999950 0.999970 0.999980
                           PC35     PC36     PC37     PC38      PC39
Standard deviation     0.004322 0.003937 0.003217 0.002008 4.423e-16
Proportion of Variance 0.000010 0.000010 0.000000 0.000000 0.000e+00
Cumulative Proportion  0.999990 0.999990 1.000000 1.000000 1.000e+00

 

PCA on Frequencies

Just like working on raw data, but coalescing all the individuals into single populations defined by allele frquency matrices.

Detailed Visualizations

 

Principal Coordinate Analysis

Like PCA but using distance matrices instead of raw data.

[1] 39 39

 

Importance of components:
                          PC1    PC2     PC3     PC4     PC5    PC6    PC7
Standard deviation     3.4963 2.3244 1.38995 0.76870 0.62286 0.5129 0.4473
Proportion of Variance 0.5622 0.2485 0.08884 0.02717 0.01784 0.0121 0.0092
Cumulative Proportion  0.5622 0.8106 0.89946 0.92664 0.94448 0.9566 0.9658
                           PC8     PC9    PC10    PC11    PC12   PC13    PC14
Standard deviation     0.39332 0.31379 0.26270 0.23524 0.20290 0.1976 0.18482
Proportion of Variance 0.00711 0.00453 0.00317 0.00254 0.00189 0.0018 0.00157
Cumulative Proportion  0.97289 0.97742 0.98059 0.98314 0.98503 0.9868 0.98839
                          PC15    PC16    PC17    PC18    PC19    PC20    PC21
Standard deviation     0.18292 0.16247 0.14794 0.14137 0.13605 0.12182 0.11651
Proportion of Variance 0.00154 0.00121 0.00101 0.00092 0.00085 0.00068 0.00062
Cumulative Proportion  0.98993 0.99115 0.99215 0.99307 0.99392 0.99461 0.99523
                          PC22   PC23    PC24    PC25    PC26    PC27    PC28
Standard deviation     0.11066 0.1039 0.10234 0.09489 0.08724 0.08436 0.07748
Proportion of Variance 0.00056 0.0005 0.00048 0.00041 0.00035 0.00033 0.00028
Cumulative Proportion  0.99579 0.9963 0.99677 0.99719 0.99754 0.99786 0.99814
                          PC29    PC30    PC31    PC32    PC33    PC34    PC35
Standard deviation     0.07707 0.07387 0.06873 0.06740 0.06523 0.06105 0.05838
Proportion of Variance 0.00027 0.00025 0.00022 0.00021 0.00020 0.00017 0.00016
Cumulative Proportion  0.99841 0.99866 0.99888 0.99909 0.99929 0.99946 0.99961
                          PC36    PC37    PC38      PC39
Standard deviation     0.05684 0.05523 0.04602 3.946e-16
Proportion of Variance 0.00015 0.00014 0.00010 0.000e+00
Cumulative Proportion  0.99976 0.99990 1.00000 1.000e+00

 

Hierarchical Clustering

Clustering

A technique to build a representation of similarity between objects.

  • Supervised

  • Unsupervised

  • Individual or Group Based

From www.nature.com/articles/s41467-020-20507-3

 

Help File for hclust

Visualizing From Distance Views

Requires that the matrix objects actually be turned into dist objects (which are matrix objects with constraints).

         101      102       12      153      156      157
102 2.048994                                             
12  3.972442 4.342952                                    
153 4.099369 4.364062 1.860651                           
156 4.727214 4.754565 4.901142 4.871141                  
157 4.541334 4.629884 4.510097 4.532973 1.073274         
159 3.733735 4.047019 2.537027 3.302282 4.434527 4.121070

Visualizing From Distance Views


Call:
hclust(d = d)

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 39 

Interactive Plots

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored